Basic introduction to
statistics using R

Copy and paste the code on this webpage to RStudio Cloud, editing the values to match your data.


0. Getting started

Optional: Use R Studio online (free)

Sign up on R Studio Cloud

You can make an account on RStudio Cloud to use R Studio without downloading software. The service is free with a limited number of hours per month.

Create an R file to save work

Steps after opening R Studio

  1. Sign into RStudio Cloud.

  2. Click New Project > New RStudio Project.

  3. Click File > New File > R Script.

  4. Click the save icon and name your file.

Install and load libraries/packages

Add the code below to the top of the R file and click Source. The initial download may take a few minutes. If a package is already installed, the code will load the packages.

Essentially, it should no longer be necessary to use library() or package:: in the remaining code.

Finally, the packages in quotes below are all the libraries references in this guide. Replace them as needed, using either single (') or double (") quotes.

packages = c('DT', 'dplyr', 'magrittr', 'kableExtra', 'ggplot2', 'plotly')

package.check <- lapply(
  packages,
  FUN = function(x) {
    if (!require(x, character.only = TRUE)) {
      install.packages(x, dependencies = TRUE)
      library(x, character.only = TRUE)
    }
  }
)



Data used in examples

The data used for the examples is pre-installed in RStudio. The data set mtcars refers to different car models, which is shown below.

Please note: the data set is edited so that car names are included as the first column instead of row names. The row names are unchanged but not displayed.

# Add row names as an extra column (12th)
df <- mtcars
df$car <- rownames(df)

# Reorder columns st. 12th is first
df <- df[,c(12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)]

# Display the data set without row names
DT::datatable(df, 
  caption = 'Pre-installed data set \'mtcars\'',
  rownames = FALSE)

The edits made to the data in the previous code block are necessary.


1. Write data

After pasting the code into an R file, replace the comma-separated numbers with your data.

No variables/1 column

List of numbers

The data are provided as a list of numbers without any variables, or a table with only one column. In other words, only one ‘thing’ is being measured.

df <- c(1, 2, 3, 4, 5)



2 or more variables/columns

The letters in quotes should be replaced with the name of the column using either single or double quotes.

Table with 2 columns

df <- data.frame(
  'x' = c(1, 2, 3, 4, 5),
  'y' = c(6, 7, 8, 9, 10)
)



Table with 3 columns

df <- data.frame(
  'x' = c(1, 2, 3, 4, 5),
  'y' = c(6, 7, 8, 9, 10),
  'z' = c(11, 12, 13, 14, 15)
)

Table with 4 or more columns

Follow the pattern in the previous examples. Each column must have the same number of rows.

For example, if column x is

'x' = c(1, 2, 3) and column y is

'y' = c(11, 12, 13),

both columns have 3 rows. If column z is 'z' = c(21, 22), this column could not be included with columns x and y.)

Rename the columns of the data

Replace the letters x, y, or z with the names, surrounded by quotation marks. In general, do not include spaces or special characters in the name, except dashes (-), underscores (_), and numbers(0-9). Note that these names are case-sensitive (name and Name are different). It is good practice to always use lower-case letters first.

# DO NOT RUN

df <- data.frame(
  'good_name' = c(),
  'good-name1' = c(),
  'goodName' = c(),
  
  'bad name' = c(),
  'Bad.Name' = c(),
  'Really Bad Name' = c()
)




2. Calculate statistics

The output will appear in the Console panel in RStudio.

Measures of central tendency

Mean/Average

List df

mean(df)

Table df with column x

mean(df$x)

= 3

Median

List df

median(df)

Table df with column x

median(df$x)

= 3

Mode

See custom functions.

Measures of dispersion

Variance

List df

var(df)

Table df with column x

var(df$x)

= 2.5

Standard deviation

List df

sd(df)

Table df with column x

sd(df$x)

= 1.581139

Standard error

See custom functions.

Custom functions

Copy and paste the code below into an R file, then click Source. To use the function, call it using the second code block.

Mode

# Sort by frequency, descending order (child: mode())
sort_mode <- function(x) {
  temp <- data.frame(table(x))
  temp <- temp[order(-temp$Freq),]
  
  rownames(temp) <- NULL
  names(temp) <- c('value', 'frequency')
  
  return(temp)
}

# Find the mode (parent: sort_mode())
mode <- function(x) {
  x <- sort_mode(x)
  rows_x <- nrow(x)
  max_freq <- max(x$frequency)
  
  x$is_max <- 0
  x$is_max[x$frequency==max_freq] <- 1
  
  x <- x[x$is_max==1,]
  x <- data.frame(as.numeric(as.character(x$value)))
  rownames(x) <- NULL
  
  if(nrow(x)==rows_x){
    x = 'no mode'
  } else {
    names(x) <- c('mode(s):')
  }
  
  return(x)
}

List df

mode(df)

Table df for column x

mode(df$x)

= no mode

Standard error

se <- function(x) {
  temp <- round(sd(x)/sqrt(length(x)), digits=4)
  
  return(temp)
}

List df

se(df)

Table df for column x

se(df$x)

= 0.7071

Manually calculate r, r 2, a (y-intercept), and b (slope)

# Create table to manually calculate several statistics (r, r^2, a, and b)
make_table <- function(df) {
  require(DT)
  
  df$xy <- df$x*df$y
  df$x2 <- df$x**2
  df$y2 <- df$y**2
  
  all_sums <- c(sum(df$x), sum(df$y), sum(df$xy), sum(df$x2), sum(df$y2))
  df <- rbind(df, all_sums)
  rownames(df)[rownames(df)==as.character(nrow(df))] <- 'Total Sums'
  table <- DT::datatable(df,
    extensions = 'Buttons',
    caption = paste('Total observations: n = ', nrow(df)-1),
    options = list(
      dom = 'Bt',
      buttons = c('copy', 'csv', 'excel')
  ))
  return(table)
}

Table df for columns x and y

make_table(df)



Manually calculate continuous probabilities

The data set uses decimals in the second column, which represent probabilities.

df <- data.frame(
  'x' = c(1, 2, 3, 4, 5),
  'y' = c(0.1, 0.2, 0.3, 0.4, 0.5)
)
# Create table to manually calculate continuous probabilities
probability_table <- function(df) {
  df$mean <- df$x*df$y
  mean <- sum(df$mean)
  df <- subset(df, select=-c(mean))
  df$`x-m` <- df$x - mean
  df$`(x-m)^2` <- round(df$`x-m`**2, digits=3)
  df$`(x-m)^2*p(x)` <- round(df$`(x-m)^2`*df$y, digits=3)
  names(df)[names(df)=='y'] <- 'p(x)'
  
  var <- round(sum(df$`(x-m)^2*p(x)`), digits=3)
  sd <- round(sqrt(sum(df$`(x-m)^2*p(x)`)), digits=3)
  var_row <- c('', '', '', '', var)
  df <- rbind(df, var_row)
  sd_row <- c('', '', '', '', sd)
  df <- rbind(df, sd_row)
  rownames(df)[rownames(df)==as.character(nrow(df)-1)] <- 'Var.'
  rownames(df)[rownames(df)==as.character(nrow(df))] <- 'Std. Dev.'
  
  table <- DT::datatable(df,
    extensions = 'Buttons',
    caption = paste('See bottom rows for Variance (Var.) 
                    and Standard Deviation (Std. Dev.)'),
    options = list(
      dom = 'Bt',
      buttons = c('copy', 'csv', 'excel')
  ))
  return(table)
}

Table df for columns x and y

probability_table(df)




3a. Make basic tables

For data with 2 or more columns only. All examples use the pre-installed data set ‘mtcars’

Replace Title in each example with your own title, surrounded by quotation marks.

Standard table

Basic HTML table.

library(magrittr)

df%>%
  kableExtra::kbl(caption = 'Title')%>%
  kableExtra::kable_styling()
Title
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2



Other themes

Use a scroll bar to limit the size of a table.

Replace 100 in 100% and 200 in 200px to change the size of the table.

kableExtra::kbl(cbind(df, df)) %>%
  kableExtra::kable_paper() %>%
  kableExtra::scroll_box(width = "100%", height = "200px")
mpg cyl disp hp drat wt qsec vs am gear carb mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2



Highlight a specific row when hovering with a cursor.

library(magrittr)

df %>%
  kableExtra::kbl(caption = 'Title') %>%
  kableExtra::kable_paper("hover")
Title
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2



Highlight but do not use the width of the page.

library(magrittr)

df %>%
  kableExtra::kbl(caption = 'Title') %>%
  kableExtra::kable_paper("hover", full_width = F)
Title
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Additional information can be found here.


3b. Make interactive tables

For data with 2 or more columns only. Lists will not work.

Standard table

Replace Title with a title for the table.

DT::datatable(df, 
  rownames = FALSE,
  caption = paste('Title')
  )



With download buttons

DT::datatable(df,
  rownames = FALSE,
  extensions = 'Buttons',
  caption = paste('Title'),
  options = list(
  dom = 'Bt',
  buttons = c('copy', 'csv', 'excel')
  ))



With column filters

DT::datatable(df,
  rownames = FALSE,
  extensions = 'Buttons',
  caption = paste('Title'),
  options = list(
  dom = 'Bt'),
  filter = 'top')



Other themes

Replace Title with a title for the table.

DT::datatable(df, 
  rownames = FALSE,
  caption = paste('Title'),
  class = 'cell-border stripe'
  )
DT::datatable(df,
  rownames = FALSE,
  extensions = 'Buttons',
  caption = paste('Title'),
  class = 'cell-border stripe',
  options = list(
  dom = 'Bt',
  buttons = c('copy', 'csv', 'excel')
  ))

Additional information can be found here.


4. Statistical figures

The following section references columns within data sets. To reference a specific column for a data set df, we use a dollar sign $ afterwards and write the name of the column:

# DO NOT RUN
df$name_of_column

The column names of the data set used for this example can be found below.

## car mpg cyl disp hp drat wt qsec vs am gear carb



Scatter plot

Replace mpg and wt with the column names of quantitative data. Note that each column name is preceded with a tilde ~ without a space. Replace Title with a title for the figure.

library(magrittr)

plotly::plot_ly(
  data = df,
  x = ~wt, 
  y = ~mpg)%>%
  plotly::layout(title='Title')



Scatter plot with a legend

Replace mpg and wt with the column names of quantitative data. Note that each column name is preceded with a tilde ~ without a space. Replace Title with a title for the figure.

Continuous variable

Replace cyl with a column that will determine the color each point receives. If the data is numerical, it will create a gradient by default.

library(magrittr)

plotly::plot_ly(
  data = df,
  x = ~wt, 
  y = ~mpg,
  color = ~cyl)%>%
  plotly::layout(title='Title')



Discrete variable

Although the column cyl is numerical, there are only three unique values that appear. By appending factor, the column is treated as qualitative data, where a unique number is a group.

Replace cyl with qualitative data, including numerical data with discrete values only.

library(magrittr)

plotly::plot_ly(
  data = df,
  x = ~wt, 
  y = ~mpg,
  color = ~factor(cyl))%>%
  plotly::layout(title='Title')



Bar chart

library(magrittr)

plotly::plot_ly(
  data = df,
  x = ~car, 
  y = ~mpg,
  type = 'bar')%>%
  plotly::layout(title='Title')



Pareto chart

library(magrittr)

plotly::plot_ly(
  data = df,
  x = ~reorder(car, -mpg), 
  y = ~mpg,
  type = 'bar')%>%
  plotly::layout(title='Title')%>%
  plotly::layout(xaxis = list(title='car'))



Box-and-whisker plot

plotly::plot_ly(data = df, y=~mpg, type = 'box', hoverinfo = 'y', name = '')%>%
  plotly::layout(title = 'Title')



Histogram

plotly::plot_ly(data=df, x=~mpg, type='histogram')
plotly::plot_ly(data=df, x=~mpg, type='histogram', histnorm='probability')



Additional information

Additional information can be found here.


5. Regression models

Using the data mtcars, create an ordinary least-squares (OLS) regression model where the miles per gallon (mpg) of each car is a function of its weight (wt).

Create an OLS regression model

\[miles\ per\ gallon = \alpha + \beta(weight) + \epsilon\]

\[mpg = \alpha + \beta \cdot wt + \epsilon\]

lm(mpg~wt, data=df)
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Coefficients:
## (Intercept)           wt  
##      37.285       -5.344



Summarize a model

summary(lm(mpg~wt, data=df))
## 
## Call:
## lm(formula = mpg ~ wt, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5432 -2.3647 -0.1252  1.4096  6.8727 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
## wt           -5.3445     0.5591  -9.559 1.29e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.046 on 30 degrees of freedom
## Multiple R-squared:  0.7528, Adjusted R-squared:  0.7446 
## F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10



From the results, the following models are created:

OLS model:

\[mpg = 37.29 - 5.35wt + \epsilon\]

Fitted model:

\[\hat{\text{MPG}} = 37.29 - 5.35 \cdot \hat{\text{WEIGHT}}\]

Using the fitted model, we predict the miles per gallon (\(\hat{\text{MPG}}\)) of a car using its weight (\(\hat{\text{WEIGHT}}\)) by plugging-in the weight of the car and solving the equation. In other words, multiply the weight by 5.35 and subtract the product from 37.29.

Draw the line of best fit

Basic scatter plot from the data set.

plot(x=df$wt, y=df$mpg, data=df)

Draw the regression line over the scatter plot.

plot(x=df$wt, y=df$mpg, data=df)
abline(lm(mpg~wt, data=df))

Change the color of the regression line and change the labels of the scatter plot.

plot(x=df$wt, y=df$mpg, data=df,
     main='Title',
     xlab='x-axis',
     ylab='y-axis')

abline(lm(mpg~wt, data=df), col='red')

Change the points of the scatter plot (25 options under pch; option 19 is filled-in circles).

# options 1 - 25 for pch
plot(x=df$wt, y=df$mpg, data=df,
     main='Title',
     xlab='x-axis',
     ylab='y-axis',
     pch=19)

abline(lm(mpg~wt, data=df), col='red')



Group the observations

plot(x=df$wt, y=df$mpg, col=factor(df$cyl), data=df,
     main='Title',
     xlab='x-axis',
     ylab='y-axis',
     pch=19)

legend("topright",
       title='Legend title',
       legend=levels(factor(df$cyl)), 
       pch=19,
       col=unique(factor(df$cyl)))

abline(lm(mpg~wt, data=df), col='red')